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Q I Abstract. Consider a regression model with fixed design and Gaussian noise 

CN ■ where the regression function can potentially be well approximated by a 

bX), function that admits a sparse representation in a given dictionary. This 

^ I paper resorts to exponential weights to exploit this underlying sparsity by 

implementing the principle of sparsity pattern aggregation. This model 

■ selection take on sparse estimation allows us to derive sparsity oracle in- 
^ ' equalities in several popular frameworks including ordinary sparsity, fused 

sparsity and group sparsity. One striking aspect of these theoretical re- 

■ suits is that they hold under no condition on the dictionary. Moreover, we 
describe an efficient implementation of the sparsity pattern aggregation 
principle that compares favorably to state-of-the-art procedures on some 

J2 ' basic numerical examples. 
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00 ■ 1. INTRODUCTION 

o : 

Since the 1990ies, the idea of exponential weighting has been successfully used 
in a variety of statistical problems. In this paper, we review several properties 
of estimators based on exponential weighting with a particular emphasis on how 
^ ■ they can be used to construct optimal and computationally efficient procedures 

for high-dimensional regression under the sparsity scenario. 

Most of the work on exponential weighting deals with a regression learning 
problem. Some of the results can be extended to other statistical models such 
as density estimation or classification, cf. Section 6. For the sake of brevity and 
to make the presentation more transparent, we focus here on the following frame- 
work considered in Rigollet and Tsybakov (2011). Let Z = {(xi, li), . . . , {xn,Yn)} 
be a collection of independent random couples such that (xj, Yi) G x IR, where 
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X is an arbitrary set. Assume the regression model: 

(1.1) Yi = r]{xi) i = l,...,n, 

where r/ : — )• IR is the unknown regression function and the errors are inde- 
pendent Gaussian A/'(0,(T^). The covariates are deterministic elements xi, . . . , Xn 
of X. For any function / : — )• IR we define a seminorm || • || by^ 

i=l 

We adopt the following learning setup. Let T-l = {/i, . . . , /a/}, be a dictionary 
of M > 1 given functions. For example, fj can be some basis functions or some 
preliminary estimators of / constructed from another sample that we consider 
as frozen (see Section 4 for more details). Our goal is to approximate the re- 
gression function rj hy a linear combination f0{x) = ^j^i^jfjix) with weights 
9 = (01, ... , 9m), where possibly M ^ n. The performance of a given estimator 
/ of a function rj is measured in terms of its averaged squared error 

1 " r i2 

^(/) = ii/-^f = -E • 

n ^-^ I J 

i=l 

Let be a given subset of IR^^. In the aggregation problem, we would ideally 
wish to find an aggregated estimator f whose risk R{f) is as close as possible in a 
probabilistic sense to the minimum risk infgge i?(fe). Namely, one can construct 
estimators / satisfying the following property: 

(1.2) JER{f) < C inf i?(fe) + 5n,M(e) , 

where (5„.a/(0) is a small remainder term characterizing the performance of the 
given aggregate / and the complexity of the set 0, C > 1 is a constant, and IE de- 
notes the expectation. Bounds of the form (1.2) are called the oracle inequalities. 
In some cases, even more general results are available. They have the form 

(1.3) IEii(/) < C mf^ {R{^e) + A„,m(0)} , 

where A„^jvf is a remainder term that characterizes the performance of the given 
aggregate / and the complexity of the parameter € B' C IR^^ (often 0' = IR^^). 
To distinguish from (1.2), we will call bounds of the form (1.3) the balanced 
oracle inequalities. If C 0' then (1.2) is a direct consequence of (1.3) with 

(^n,M(6) = SUPege An,MiO)- 

In this paper, we mainly focus on the case where the complexity of a vector 9 is 
measured as the number of its nonzero coefficients \9\o. In this case, inequalities 
of the form (1.3) are sometimes called sparsity oracle inequalities. Other mea- 
sures of complexity, also related to sparsity are considered in Subsection 5.2. As 

^Without loss of generality, in what follows we will associate all the functions with vectors in 
IR" since only the values of functions at points xi, . . . ,Xn will appear in the risk. So, || • || will 
be indeed a norm and, with no ambiguity, we will use other related notation such as ||Y — /|| 
where Y is a vector in IR" with components Yi, . . . ,¥„. 
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indicated by the notation and illustrated below, the remainder term A„^a,/(^) de- 
pends explicitly on the size M of the dictionary and the sample size n. It reflects 
the interplay between these two fundamental parameters and also the complexity 
of0. 

When the linear model is misspecified, that is where there is no ^ G such that 
ry = fg on the set {xi, . . . , the minimum risk satisfies infgge R{fe) > leading 
to a systematic bias term. Since this term is unavoidable, we wish to make its 
contribution as small as possible and it is therefore important to obtain a leading 
constant C = 1. Many oracle inequalities with leading constant C > 1 can be 
found in the literature for related problems. However, in most of the papers, the 
set @ = @n depends on the sample size n in such a way that infege^ Ri^e) tends 
to as n goes to infinity, under additional regularity assumptions. In this paper, 
we are interested in the case where Q is fixed. For this reason, we consider here 
only oracle inequalities with leading constant C = 1 (the so called exact or sharp 
oracle inequalities). Because they hold for finite M and n, these are truly finite 
sample results. 

One salient feature of the oracle approach as opposed to standard statistical 
reasoning, is that it does not rely on an underlying model. Indeed, the goal is 
not to estimate the parameters of an underlying "true" model but rather to 
construct an estimator that mimics, in terms of an appropriate oracle inequality, 
the performance of the best model in a given class, whether this model is true or 
not. From a statistical viewpoint, this difference is significant since performance 
cannot be evaluated in terms of parameters. Indeed, there is no true parameter. 
However, we can still compare the risk of the estimator with the optimum value 
and oracle inequalities offer a framework for such a comparison. 

A particular choice of corresponds to the problem of model selection. Let = 
0^*^ to be the set of canonical vectors of IR^^. Then the set of linear combinations 
{fg, 6 G O'^Cj coincides with the initial dictionary of functions H = {/i, . . . , /m}, 
so that the goal of model selection is to mimic the best function in the dictionary 
in the sense of the risk measure R{-)- This can be done in different ways leading 
to different rates dn,M{&^'~^), however one is mostly interested in the methods 
that attain the optimal rate known to be (5* a/(0^*^) x (log M)/n (see Tsybakov, 
2003; Bunea, Tsybakov and Wegkamp, 2007; Rigollet, 2009). Catoni (1999) was 
the first to show that, for model selection in Gaussian regression with random 
design, oracle inequalities of the form (1.2) with the optimal rate (logM)/n are 
satisfied for the progressive mixture method based on exponential weighting (see 
also Catoni, 2004). Other popular methods for model selection include AIC, BIG 
or Mallows' Cp. They all consist in selecting a function in the dictionary by 
minimizing a penalized empirical risk. One of the major novelties offered by 
exponential weighting is to combine (average) the functions in the dictionary 
using a convex combination and not simply to select one of them. From the 
theoretical point of view, selection of one of the functions has a fundamental 
drawback since it does not attain the optimal rate (log M)/n (cf. Section 2). 

The rest of the paper is organized as follows. In the next section, we discuss 
some connections between the exponential weighting schemes and penalized em- 
pirical risk minimization. In Section 3, we present the first oracle inequalities 
that demonstrate how exponential weighting can be used to efficiently combine 
functions in a dictionary. The results of Section 3 are then extended to the case 
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where one wishes to combine not deterministic functions but a special kind of 
estimators. Balanced oracle inequalities for this problem are derived in Section 4. 
Section 5 shows how these results can be adapted to deal with sparsity. We intro- 
duce the principle of sparsity pattern aggregation, and we derive sparsity oracle 
inequalities in several popular frameworks including ordinary sparsity, fused spar- 
sity and group sparsity. Finally, we describe an efficient implementation of the 
sparsity pattern aggregation principle and compare its performance to state-of- 
the-art procedures on some basic numerical examples. 

2. EXPONENTIAL WEIGHTING AND PENALIZED RISK MINIMIZATION 

2.1 Suboptimality of selectors 

A natural candidate to solve the problem of model selection introduced in the 
previous section is an empirical risk minimizer. Define the empirical risk by 



1 " 

Rn{f) = -Y.iY^-f{x^)f = \\Y-f\\' 



n 

i=l 



and the empirical risk minimizer by 

(2.1) r^^ = argmini?„(/), 

fen 

where ties are broken arbitrarily. However, while this procedure satisfies an ex- 
act oracle inequality, it fails to exhibit the optimal remainder term of order 
(5*^(0^'~') X (logM)/n. The following result shows that this defect is intrin- 
sic not only to empirical risk minimization but also to any method that selects 
only one function in the dictionary Ti. This includes traditional methods of model 
selection by penalized empirical risk minimization such as AIC and BIC. We call 
estimators Sn taking values in T-L selectors. 

Theorem 2.1. Assume that \\fj\\ < 1 for any fj G T-L. Any empirical risk 
minimizer /erm d^ji^gf]^ (2.1) satisfies the following oracle inequality 



(2.2) jERifn < ,^%Rifj) + 

Moreover, assume that 



(2.3) (aVl)V(logM)/n<C7o 

for < Co < 1 small enough. Then, for any selector Sn, and in particular, for any 
selector based on penalized empirical risk minimization, there exist a regression 
function r] and a dictionary % = {/i,---,/m} such that \\rj\\ < 1, \\fj\\ < 1 for 
any fj £ % and 



loeM 

(2.4) ^R{Sn) > min i?(/,) + C,<j\j , 

i<j<M V n 

for some positive constant C^=. 
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Proof. See appendix. 

It follows from the lower bound (2.4) that selecting one of the functions in 
a finite dictionary T-L to solve the problem of model selection is suboptimal in 
the sense that it exhibits a too large remainder term, of the order (log M) /n. 
Indeed, one can do better if we take a mixture, that is a convex combination of the 
functions in Ti. We will see in Section 3, cf. (3.4), that under a particular choice 
of weights in this convex combination, namely the exponential weights, one can 
achieve oracle inequalities with much better rate (log M)/n. This rate is known to 
be optimal in a minimax sense in several regression setups including the present 
one (see Tsybakov, 2003; Bunea, Tsybakov and Wegkamp, 2007; Rigollet, 2009). 

2.2 Exponential weighting as a penalized procedure 

Penalized empirical risk minimization for model selection has received a lot 
of attention in the literature and many choices for the penalty can be consid- 
ered (see, e.g., Birge and Massart, 2001; Bartlett, Boucheron and Lugosi, 2002; 
Lugosi and Wegkamp, 2004; Bunea, Tsybakov and Wegkamp, 2007) to obtain or- 
acle inequalities with the optimal remainder term. However, all these inequalities 
exhibit a constant C > 1 in front of the leading term. This is not surprising as we 
have proved in the previous section that it is impossible for selectors to satisfy 
sharp oracle inequalities like (1.2) with the optimal remainder term. To overcome 
this limitation of selectors, we look for convex combinations of the functions in 
the dictionary. 

Let A*"'^ denote the flat simplex of IR*''^ defined by 



A^ 



M 

A G IR^^ : Aj > , ^ Aj = 1 



Let us now examine a few ways to obtain potentially good convex combinations. 
One candidate is a solution of the following penalized empirical risk minimization 
problem: 

min \ Rnih) + pen(A) \ , 

where pen(-) > is a penalty function. This choice looks quite natural since it 
provides a proxy of the right-hand side of the oracle inequality (1.3) where the 
unknown risk R{-) is replaced by its empirical counterpart i2„(-). The minimum 
is taken over the simplex A*^^ because we are looking for a convex combination. 
Clearly, the penalty pen(-) should be carefully chosen and ideally should match 
the best remainder term A„^m(0- Yet, this problem may be difficult to solve as it 
involves a minimization over A*^. Instead, we propose to solve a simpler problem. 
Consider the following linear upper bound on the empirical risk: 

M 

i=i 

and solve the following optimization problem 



(2.5) min <^ 5jA,i?„(/,)+pen(A) 
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Note that if pen = 0, the solution A of (2.5) is simply the empirical risk minimizer 
over the vertices of the simplex so that f^^ = J^^^. In general, depending on the 
penalty function, this problem may be more or less difficult to solve. It turns 
out that the Kullback-Leibler penalty leads to a particularly simple solution and 
allows us to approximate the best remainder term A„ a/(-) thus adding great 
flexibility to the resulting estimator. 

Observe that vectors in A*-'^ can be associated to probability measures on 
{1, . . . , M}. Let A = (Ai, . . . , Am) and tt = (vri, . . . , ttm) be two probability mea- 
sures on { 1 , . . . , M} and define the Kullback-Leibler divergence between A and vr 
by 

M 



/C(A,7r) = fjA,log (^^) >0. 



Here and in the sequel, we adopt the convention that OlogO = 0, 01og(a/0) = 0, 
and log(a/0) = oo, for any a > 0. 

Exponential weights can be obtained as the solution of the following minimiza- 
tion problem. Fix /3 > 0, a prior n G A*^ and define the vector A'^ by 



(2.6) A'^ = argmin < 

aga*^ 



This constrained convex optimization problem has a unique solution that can 
be expressed explicitly. Indeed, it follows from the Karush-Kuhn- Tucker (KKT) 
conditions that the components AJ of A'^ satisfy 

(2.7) nRnifj) + /? log +^-5j=0, j = 1, . . . , M, 

where /i, . . . , 6m > are Lagrange multipliers, and 

M 

AJ>0, 5,AJ = 0, E^^ = l- 

i=i 

Equation (2.7) together with the above constraints leads to the following closed 
form solution: 



exp(-refin(/j)//3)7rj 
Efcli exp(-ni?„(/fc)//3)7rfc 



called the exponential weights. We see that one immediate effect of penalizing by 
the Kullback-Leibler divergence is that the solution of (2.6) is not a selector. As 
a result, it achieves the desired effect of averaging as opposed to selecting. 

3. ORACLE INEQUALITIES 

An aggregate is an estimator defined as a weighted average of the functions in 
the dictionary T-l with some data-dependent weights. We focus on the aggregate 
with exponential weights: 

M 
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where A J is given in (2.8). This estimator satisfies the following balanced oracle 
inequahty. 

Theorem 3.1. The aggregate with (3 > 4(T^ satisfies the following balanced 
oracle inequality 

{ ^ B 

(3.1) JERin < mm | J^XjRif,) + ;^^(A,vr) 

The proof can be found in the papers of Dalalyan and Tsybakov (2007, 2008) 
containing more general results. In particular, they apply to non-Gaussian dis- 
tributions of errors and to exponential weights with a general (not necessarily 
discrete) probability distribution vr on ]R*^. Dalalyan and Tsybakov (2007, 2008) 
show that the corresponding exponentially weighted aggregate satisfies the 
following bound 

(3.2) IEi?(/:) < inf |y Rife)p{d9) + -^/C(p,7r)| , 

where the minimum is taken over all probability distributions p on ]R*^ and 
/C(p, tt) denotes the Kullback-Leibler divergence between the general probability 
measures p and vr. The bound (3.1) follows immediately from (3.2) by taking p 
and vr as discrete distributions. 

A useful consequence of (3.1) can be obtained by restricting the minimum on 
the right-hand side to the vertices of the simplex A*'^. These vertices are precisely 
the vectors e^^^ , • • • , e^^^^ that form the canonical basis of IR*-^ so that 

M 

Y,ef^R{fj) = R{fk)- 

It yields 

(3.3) IEi?(r) < ^mm^ |i?(/,) + log(vr7i)} . 

Taking ttj to be the uniform distribution on {1, . . . ,M} leads to the following 
oracle inequality 

(3.4) lERin < ^ min ^ R{f,) + log M , 

that exhibit a remainder term of the optimal order (log M)/n. 

The role of the distribution vr is to put a prior weight on the functions in the 
dictionary. When there is no preference, the uniform prior is quite a common 
choice. However, we will see in Section 5 that choosing non-uniform weights de- 
pending on suitable sparsity characteristics can be very useful. Moreover, this 
methodology can be extended to many cases where one wishes to learn with 
a prior. It is worth mentioning that while the terminology is reminiscent of a 
Bayesian setup, this paper deals only with a frequentist setting (the risk is not 
averaged over the prior). 
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4. AGGREGATION OF ESTIMATORS 

4.1 Sample splitting 

Akin to the setting of the previous section, exponential weights were originally 
introduced to aggregate deterministic functions from a dictionary. These func- 
tions can be chosen in essentially two ways. Either they have good approximation 
properties such as an (over-complete) basis of functions or they are constructed 
as preliminary estimators using a held-out sample. The latter case corresponds 
to the problem of aggregation of estimators originally described in Nemirovski 
(2000). The idea put forward by Nemirovski and later extended to several sce- 
narios (see, e.g., Yang, 2004; Rigollet and Tsybakov, 2007; Lecue, 2007) consists 
in splitting the sample at hand into two parts^. The first part is used to con- 
struct estimators while the second is used to perform aggregation, in particular 
to construct exponential weights. To carry out the analysis, it is standard to work 
conditionally on the first sample so that the problem is equivalent to working with 
a deterministic dictionary of functions. 

Nevertheless, the idea of sample splitting does not carry over to independent 
samples that are not identically distributed as in the present setup. Indeed, the 
observations in the first sample no longer have the same distribution as those in 
the second sample. To overcome this limitation, one would like to aggregate esti- 
mators using the same observations for both estimation and aggregation. While 
for general estimators this approach would clearly lead to overfitting, it has been 
shown that it yields good oracle inequalities for certain types of estimators, first 
for projection estimators (Leung and Barron, 2006) and more recently for a more 
general class linear (affine) estimators (Dalalyan and Salmon, 2011). 

4.2 Aggregation of linear estimators 

Suppose that we are given a finite family {fi, . . . , fx} of linear estimators 
defined by 

(4.1) fj{x)=Y^aj{x), 

where aj(-) are given functions with values in M". This representation is quite gen- 
eral; for example, fj can be ordinary least squares, (kernel) ridge regression esti- 
mators, diagonal linear filter estimators etc. (see Kneip, 1994; Dalalyan and Salmon, 
2011, for a longer list of relevant examples). The vector of values {fj{xi),i = 
1, . . . , n) equals to j4jY where Aj an n x n matrix with rows aj{xi),i = 1, . . . , n. 

Now, we would like to consider mixtures of such estimators rather than mix- 
tures of deterministic functions as in the previous sections. For this purpose, ex- 
ponential weights have to be slightly modified. Indeed, note that in Section 2, the 
risk of a deterministic function fj is simply estimated by the empirical risk Rn{fj), 
which is plugged into the expression for the weights. Clearly, JERn{fj) = R{fj) so 
that Rn{fj) is an unbiased estimator of the risk R{fj) of fj. For a linear estimator 
fj defined in (4.1), Rn{fj) is no longer an unbiased estimator of the risk R{fj). 
It is well known that the risk of the linear estimator fj has the form 

IEi?(/,) = ||(^,-I)r?f + ^TY[4^,], 

^More precisely, Nemirovski (2000) considered a randomized procedure rather than sample 
splitting. The two samples were obtained from the original one by a randomization and the 
argument was restricted to the Gaussian model. 
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where Tr[^] denotes the trace of a matrix A, and I denotes the n x n iden- 
tity matrix. Moreover, an unbiased estimator of JER{fj) is given by a version of 
Mahows' Cp-. 

(4.2) = ||Y - /, IP + ^ Tr[A,] - a' . 



Then, for hnear estimators, the exponential weights and the corresponding ag- 
gregate are modified as follows: 

(43) X- - exp(-nfir"(/,)//3)vr, = V f 

' Eti^M-nRTHfk)/(^)^k' h 

Note that for deterministic fj, we naturally define R^^{fj) = Rn{fj), so that 
definition (4.3) remains consistent with (2.8). With this more general definition of 
exponential weights, Dalalyan and Salmon (2011) prove the following risk bounds 
for the aggregate f'^. 

Theorem 4.1. Let {fi, . . . , fx} be a family of linear estimators defined in (4.1) 
such that the matrices Aj are symmetric, positive definite, and AjA^ = A^Aj, for 
dll ^ ^ j,k < K . Then the exponentially weighted aggregate defined in (4.3) 
with j3 > 8(T^ satisfies 

(4.4) ]Ei?(r) < mm |^ A,]Ei2(/,) + ^^(A,vr)| , 

(4.5) Ei?(r) < ^ min^ \^R{h) + ^ log(^7^)} . 

If all the Aj are projection matrices (Aj = Aj, A'j = Aj), then the above inequal- 
ities hold with /3 > 4(T^. 

Here, the bound (4.5) follows immediately from (4.4). In the rest of the paper, 
we mainly use the last part of this theorem concerning projection estimators. The 
bound (4.5) for this particular case was originally proved in Leung and Barron 
(2006). The result of Dalalyan and Salmon (2011) is, in fact, more general than 
Theorem 4.1 covering non-discrete priors in the spirit of (3.2), and it applies not 
only to linear but also to affine estimators fj. 

5. SPARSE ESTIMATION 

A family of projection estimators that we consider in this section is the family 
of all 2*^ least squares estimators, each of which is characterized by its sparsity 
pattern. We examine properties of these estimators, and show that their mixtures 
with exponential weights satisfy powerful sparsity oracle inequalities for suitably 
chosen priors vr. 

5.1 Sparsity pattern aggregation 

Assume that we are given a dictionary of functions H = {/i, . . . , /a/}. However, 
we will not aggregate the elements of the dictionary but rather least squares 
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estimators depending on all the fj. We denote by X, the n x M design matrix 
with elements Xjj = fj{xi), i = 1, . . . ,n, j = 1, . . . , M. 

A sparsity pattern is a binary vector p G "P := {0, 1}^^. The terminology comes 
from the fact that the coordinates of any such vectors can be interpreted as 
indicators of presence {pj = 1) or absence (pj = 0) of a given feature indexed by 
j G {!,..., M}. We denote by |p| the number of ones in the sparsity pattern p 
and by MP the space defined by 

MP = {e-p : ee M^^} C IR^ , 

where • p G IR^^ denotes the Hadamard product between and p and is defined 
as the vector with coordinates given by (6* • p)j = 6jPj,j = 1 . . . , M. 
For any p G let 0p be any least squares estimator defined by 

M 

(5.1) 0p G argmin ||Y - fg\\'^ with fg = 0jfj. 

The following simple lemma gives an oracle inequality for the least squares es- 
timator. It follows easily from the Pythagorean theorem. Moreover, the random 
variables Ci, ■ ■ ■ need not be Gaussian for the result to hold. 

Lemma 5.1. Fix p £ V. Then any least squares estimator 0p defined in (5.1) 
satisfies 

R I dI 

(5.2) Ellf/i - ?7|P = min life - rif + ct^— ^ < min llfg - nf + — 

"^p eeRp n 6»grp n 

where Rp is the dimension of the linear suh space {X0 : G MP^ . 

Clearly, if |p| is small compared to n, the oracle inequality gives a good per- 
formance guarantee for the least squares aggregate f^ . Nevertheless it may be 
the case that the approximation error mingg^p \\^q — 77 |p is quite large. Hence, 
we are looking for a sparsity pattern such that |p| is small and that yields a 
least squares aggregate with small approximation error. This is clearly a model 
selection problem as described in Section 1. 

Observe that for each sparsity pattern p G V, the function f^ is a projection 
estimator of the form fg = ^pY where the nx n matrix Ap is the projector onto 
the linear span of {fj £ 7i : Pj = 1} (as above, we identify the functions /j,fg 
with the vectors of their values at points xi, . . . , x„ since the risk depends only on 
these values). Therefore Tr[^p] = Rp. We have seen in the previous section that 
projection estimators can be aggregated to solve the problem of model selection 
using exponential weights. Thus, instead of selecting the best sparsity pattern, 
we resort to taking convex combinations leading to what is called sparsity pattern 
aggregation. For any sparsity pattern p G 7^, define the exponential weights Ap 
and the sparsity pattern aggregate f^ respectively by 

' E,'ev exp(-ni?rHf,- , )//?) V ' ^ ' ^ 
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where tt = (7rp)pg-p is a probability distribution (prior) on the set of sparsity 
patterns V. 

To study the performance of this method, we can now apply the last part of 
Theorem 4.1 dealing with projection matrices. Combining (4.5) and Lemma 5.1 
we get 



where p(^) G IR^^ denotes the sparsity pattern of 6 and is defined as a vector 
with components Pj{9) = 1 if 9j / 0, and Pj{6) = otherwise. 

The remainder term in the balanced oracle inequality (5.3) depends on the 
choice of the prior vr. Several choices can be considered depending on the infor- 
mation that we have about the oracle, i.e., about a potentially good candidate 
6 that we would like to mimic. For example, we can assume that there exists a 
good 6 that is coordinatewise sparse, group sparse or even that 6 is piecewise 
constant. While this approach to structure the prior knowledge seems to fit in a 
Bayesian framework, we only pursue a frequentist setup. Indeed, our risk measure 
is not averaged over a prior. Such priors on good candidates for estimation are 
often used in a non-Bayesian framework. For example, in nonparametric estima- 
tion, it is usually assumed that a good candidate function is smooth. Without 
such assumptions, one may face difficulties in performing meaningful theoretical 
analysis. 

5.2 Sparsity priors 

5.2.1 Coordinatewise sparsity This is the basic and most commonly used form 
of sparsity. The prior tt should favor vectors 6 that have a small number of nonzero 
coordinates. Several priors have been suggested for this purpose, cf. Leung and Barron 
(2006); Giraud (2007); Rigollet and Tsybakov (2011); Alquier and Lounici (2011). 
We consider here yet another prior, close to that of Giraud (2007). The main dif- 
ference is that the prior vr*^ below exponentially downweights sparsity patterns 
with large |p| whereas the prior in Giraud (2007) downweights such patterns only 
polynomially. Define 



(5.3) 





k=0 



It can be easily seen that X^pg-p ''^p ~ 
probability measure on V. Note that 



1 so that vr' 




G P) is indeed a 



(5.4) 
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where we have used the inequahty (^^) < (^^j • Define the sparsity pattern 
aggregate 

(5-5) r = E K\ ' 

pe-p 

where A^^ is the exponential weight given in (4.3) and Op is the least squares 
estimator (5.1). 

Plugging (5.4) into (5.3) with (3 = 4(T^, yields the following sparsity oracle 
inequality 

(5.6) TER{n< inf\\\k-v\\' + —\0\olog('j^)+—]. 

It is important to note that (5.6) is valid under no assumption on the dictio- 
nary. This is in contrast to the Lasso and assimilated penalized procedures that 
are known to have similar properties only under strong conditions, such as re- 
stricted isometry (see, e.g., Candes and Tao, 2007; Bickel, Ritov and Tsybakov, 
2009; Koltchinskii, Lounici and Tsybakov, 2010). 

Another choice for vr in the framework of coordinatewise sparsity can be found 
in Rigollet and Tsybakov (2011) and yields the exponential screening estimator. 
The exponential screening aggregate satisfies an improved version of the above 
sparsity oracle inequality with |^|o replaced by min(|0|o,-R) where R is the rank 
of the design matrix X. In particular, if the rank R is small, the exponential 
screening aggregate adapts to it. Moreover, it is shown in Rigollet and Tsybakov 
(2011) that the remainder term of the oracle inequality is optimal in a minimax 
sense. 

5.2.2 Fused sparsity When there exists a natural order among the functions 
/i ) • • • ) Im in the dictionary, it may be appropriate to assume that there exists 
a piecewise constant 6 G IR^^, with components taking only a small number of 
values, that has good approximation properties. This idea has been exploited in 
the image denoising literature for two decades, originating with the classical paper 
by Rudin, Osher and Fatemi (1992). Recently, the fused Lasso was introduced 
in Tibshirani et al. (2005) to deal with the same problem in one dimension instead 
of two. Here we suggest another method that takes advantage of fused sparsity 
using the idea of mixing with exponential weights. Its theoretical advantages are 
demonstrated by the sparsity oracle inequality in Corollary 5.1 below. 

At first sight, this problem appears to be different from the one considered 
above since a good 9 G IR^^ need not be sparse. Yet, the structural assumption 
on 9 can be reformulated into a coordinatewise sparsity assumption. Indeed, let D 
be the M x M matrix defined by the relations {D9)i = 9i and {D9)j = 9j — 9j-i 
for j = 2, . . . , M. For each sparsity pattern p £ V, we consider the least squares 
estimator 

(5.7) G argmin II Y -fef. 

Deew 

The corresponding estimator fs^, (as previously, without loss of generality we 

p 

consider f^^, as an n-vector) takes the form fg^ = where is the projector 
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onto the hnear space £p = {Xe : D6 G IRP}. In particular, Tr[^^] = R^, where 
the Rp is the dimension of Cp. Moreover, it is straightforward to obtain the 
following result, analogous to Lemma 5.1. 

Lemma 5.2. Fix p G "P. Then any least squares estimator 0^ defined in (5.7) 
with any invertihle matrix D satisfies 

j^D I I 

(5.8) W:\\UD-r}f= min ||fe - f?f + (t^— H- < min Hfg - r/f + . 

D9<mP D9eW 

We are therefore in a position to apply the results from Section 4. If p is sparse, 
the least squares estimator 9^ is piecewise constant with a small number |p| of 
jumps. 

Now, since the problem has been reduced to coordinatewise sparsity, we can 
choose the prior vr^ to favor vectors G IR*^ that are piecewise constant with a 
small number of jumps. Define the fused sparsity pattern aggregate by 

(5.9) f = T.Kho. 

where Ap'^ is the exponential weight defined in (4.3) and 6^ is the least squares 
estimator defined in (5.7). Combining (4.5) with Lemma 5.2 in the same way as 
in (5.3), and taking /3 = we obtain the following bound. 

Corollary 5.1. Let D be an invertihle matrix. The fused sparsity pattern 
aggregate defined in (5.9) with 13 = Aa'^ satisfies 

(5.10) IEi?(r)< inf /||f,-r?f + ^|Z)0|ologf-^V^''' 



eeR'^' { n \\Dd\() J n 

To our knowledge, analogous bounds for fused Lasso are not available. Further- 
more, Corollary 5.1 holds under no assumption on the dictionary, which cannot 
be the case for the Lasso type methods. Let us also emphasize that the corollary 
is valid for any invertihle matrix D used to account for fused sparsity and not 
only for the standard "first differences" D defined above. Other choice of D allows 
for higher order differences, combinations of differences of several orders etc. 



5.2.3 Group sparsity Since recently, estimation under group sparsity has been 
intensively discussed in the literature. Starting from Yuan and Lin (2006), several 
estimators have been studied, essentially the Group Lasso and some related pe- 
nalized techniques. Theoretical properties of the Group Lasso are treated in some 
generality by Huang and Zhang (2010) and Lounici et al. (2010) where one can 
find further references. Here we show that one can deal with group sparsity using 
exponentially weighted aggregates. The new estimator that we propose presents 
some theoretical advantages as compared to the Group Lasso type methods. 

Let Bi, . . . , Bk be a given partition of {1, ... , M} and denote by . . . , \Bk\ 
their respective cardinality. For any 6 G IR*^ consider the unique group decom- 
position 9 = X^A^i ^^'^^ where O^''^ G IR^^ has coordinates given by 

^[k] ^ f Oj if j G Bk 
^ \ otherwise 
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We also denote by g{9) the number of non-null O^'^^ in the group decomposition 
of 0. In this setup, it is assumed that there exists 9 G IR^''^ such that ||fg — r/|p is 
small and such that g{9) is small. Moreover, for each B^, there exists a unique 
sparsity pattern pl*^! G V that indicates whether a coordinate is in 5^. Thus, the 
sparsity pattern pl'^l has coordinates p^'^' = 1 if j G 5^ and otherwise. Let now 
J be a subset {1, . . . , K} and define the sparsity pattern p"^ by 



Jk] 

k&J 



Consider the set of all such sparsity patterns: 

ra = {p' ,JC{1,...,K}} . 

For each sparsity pattern p'^ G Vq, recall that Opj denotes the least squares 
estimator constrained to having null coordinates outside of (J/teJ ^i<^' 

6pj G argmin ||Y — fe|p . 

Define the following prior on Vc, that enforces group sparsity: 

'1 



where \J\ denotes the cardinality of the set J. As in (5.4), we obtain 

-1" 



(5.11) log 



<2|J|log(^)+i. 



We introduce now the sparsity pattern aggregate 

pe-Pc. 

where A^*^' is the exponential weight defined in (4.3) and 9p is the least squares 
estimator defined in (5.1). 

Plugging (5.11) into (5.3) yields the following oracle inequality. 

Corollary 5.2. The group sparsity pattern aggregate f^' defined in (5.12) 
with 13 = 4(T^ satisfies 

(5.13) lERin< inf lllf,-, f + a^^i^ + ^5(0) log f^) I . 

9&n^' [ n n \9{9) J J 

We see from this corollary that if there exists an ideal "oracle" 6 in i?*^ such 
that the approximation error ||fe — r/p is small and 9 is sparse in the sense that 
it is supported by a small number of groups, then the sparsity pattern aggregate 
mimics the risk of this oracle. 

To illustrate the power of the oracle inequality (5.13), we consider the multi- 
task learning setup as in Lounici et al. (2010). Namely, we assume that all the 
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groups are of the same size T, and we restrict our analysis to the class J-g of 
regression functions r] such that rj = for some 9 satisfying g{6) < s where 
s < K is some given integer. Then \6\q < sT and (5.13) implies that, uniformly 
over T] £ Ts, 



(5.14) ^Rin < ^ (^T + 9 log (^^^ + 2 

On the other hand, a minimax lower bound on the same class J-g is available 
in Lounici et al. (2010). It has exactly the form of the right-hand side of (5.14), 
cf. equation (6.2) in that paper. This immediately implies that (i) the lower 
bound of Lounici et al. (2010) is tight so that ^ (T + log(^)) is the optimal 
rate of convergence on J^g, and (ii) the estimator /'^ is rate optimal. To our 
knowledge, this gives the first example of rate optimal estimator under group 
sparsity. The upper bounds for the Group Lasso estimators in Huang and Zhang 
(2010) and Lounici et al. (2010) as well as in the earlier papers cited therein de- 
part from this optimal rate at least by a logarithmic factor. Furthermore, they are 
obtained under strong assumptions on the dictionary such as restricted isometry 
or restricted eigenvalue type conditions, while (5.14) is valid under no assumption 
on the dictionary. 

6. RELATED PROBLEMS 

In this paper, we have considered only the regression model with fixed de- 
sign. Exponentially weighted aggregates are shown to have similar properties 
(expressed in terms of sparsity oracle inequalities) in other statistical settings, 
namely, in regression with random design, density estimation and classification, 
cf. Dalalyan and Tsybakov (2010); Alquier and Lounici (2011). However, the re- 
sults differ in several aspects from those of the present paper. First, the esti- 
mators are defined as an average of exponentially weighted aggregates over the 
sample sizes from 1 to n. This averaging might be of a technical nature but it 
have not been yet shown to be superfluous. It is related to earlier work on mir- 
ror averaging, cf. Juditsky et al. (2005); Juditsky, Rigollet and Tsybakov (2008), 
which in turn is inspired by the concept of mirror descent in optimization due 
to Nemirovskii. Second, the developments in Dalalyan and Tsybakov (2010) and 
Alquier and Lounici (2011) start from the general oracle inequalities with non- 
discrete priors similar to (3.2), which are sometimes called PAC-bounds. A recent 
overview of such bounds can be found in Catoni (2007). Sparsity oracle inequal- 
ities are then derived from these bounds for exponentially weighted aggregates 
driven by continuous priors (Dalalyan and Tsybakov, 2010) or by priors with 
both discrete and continuous components (Alquier and Lounici, 2011). Finally, 
the computational algorithms are also quite diff^erent from those that we describe 
in the next section. For example, under continuous sparsity priors, one of the 
suggestions is to use Langevin Monte-Carlo, cf. Dalalyan and Tsybakov (2009, 
2010). 
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7. NUMERICAL IMPLEMENTATION 

All the sparsity pattern aggregates defined in the previous section are of the 
form f^exp, where 

(7.1) r^P = ^A-0"p 

peg 

for some G C V, Xp is the exponential weight defined in (4.3), and 0p is either Op 
defined in (5.1) or 6p defined in (5.7). 

From (7.1), it is clear that one needs to add up 2*^ (or 2^ in the case of 
group sparsity with K groups) least squares estimators to compute 9^^^ exactly. 
In many applications this number is prohibitively large. However, most of the 
terms in the sum receive an exponentially low weight with the choices of vr that 
we have described. We resort to a numerical approximation that exploits this 
fact. 

Note that 0^^^ is obtained as the expectation of the random variable 9p where 
P is a random variable taking values in V with probability distribution v given 
by 

z^n = = — r , p e G CV . 

Ep'6gexp(-ni2rnfeV)//3)v 

This Gibbs-type distribution can be expressed as the stationary distribution of 
the Markov chain generated by the Metropolis-Hastings (MH) algorithm (see, 
e.g., Robert and Casella, 2004, Section 7.3). We now describe the MH algorithm 
employed here. Note that in the examples considered in the previous section, 
Q is either the hypercube V or the hypercube Vq. For any p G G, define the 
instrumental distribution (?(-|p) as the uniform distribution on the neighbors of 
p in G, and notice that since each vertex has the same number of neighbors, we 
have g(p|q) = Q(q|p) for any p, q gV. The MH algorithm is defined in Figure 7. 
We use here the uniform instrumental distribution for the sake of simplicity. Our 
simulations show that it yields satisfactory results both in performance and in the 
speed. Another choice of q{-\-) can potentially further accelerate the convergence 
of the MH algorithm. 



Fix po 


= G IR,^-^. For any t > 0, given pt £ Q, 




1. 


Generate a random variable Qt with distribution q{ 


\Pt)- 


2. 


Generate a random variable 






p J Qt with probability 
'^^ \ Pt with probability 


r{pt,Qt) 
l-r{pt,Qt) 




where 






r(p,q) = min (j^, 




3. 


Compute the least squares estimator Op^^-^. 





Figure 1. The Metropolis-Hastings algorithm on the M -hypercube. 
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From the results of Robert and Casella (2004) (see also Rigollet and Tsybakov, 
2011, Theorem 7.1) the Markov chain (Pt)f>o is ergodic. In other words, it holds 

1 "^""^^ 

lim 7^ / ^Pt = / ^p^p ) ^ ~ almost surely , 

t=To+i peg 

where Tq > is an arbitrary integer. 

In view of this result, we approximate 6'^^'^ = X^pgg OpVp by 



To+T 

gexp 

t=To + l 



^ To+T 



which is close to 6'^^^ for sufficiently large T. One remarkable feature of the 
MH algorithm is that it involves only the ratios fq/z^p where p and q are two 
neighbors in Q. Such ratios are easy to compute, at least in the examples given 
in the previous section. As a result, the MH algorithm in this case takes the 
form of a stochastic greedy algorithm with averaging, which measures a tradeoff 
between sparsity and prediction to decide whether to add or remove a variable. 
In all subsequent examples, we use a pure R implementation of the sparsity 
pattern aggregates. While the benchmark estimators considered below employ a 
C based code optimized for speed, we observed that a safe implementation of the 
MH algorithm (three time more iterations than needed) exhibits an increase of 
computation time of at most a factor two. 

7.1 Numerical experiments 

The aim of this subsection is to illustrate the performance of the sparsity pat- 
tern aggregates and defined in (5.5) and (5.9) respectively, on a simulated 
dataset and to compare it with state-of-the-art procedures in sparse estimation. 
In our implementation, we replace the prior tt*^ by the exponential screening prior 
employed in Rigollet and Tsybakov (2011). As a result, the following results are 
about the exponential screening (es) aggregate defined in Rigollet and Tsybakov 
(2011). Nevertheless, it presents the same qualitative behavior as the aggregates 
constructed above. 

While our results for the ES estimator hold under no assumption on the dictio- 
nary, we compare the behavior of our algorithm in a well-known example where 
sparse estimation by £i-penalized techniques is theoretically achievable. 

Consider the model Y = X0* -|- a^, where X is an n x M matrix with indepen- 
dent standard Gaussian entries and ^ € IR*^ is a vector of independent standard 
Gaussian random variables and is independent of X. Depending on our sparsity 
assumption, we choose two different 9* . 

The variance is chosen as o"^ = ||f6i*|p/9 = |X0*||/(9n) following the numer- 
ical experiments of Candes and Tao (2007, Section 4). For different values of 
(n, M, 5), we run the ES algorithm on 500 replications of the problem and com- 
pare our results with several other popular estimators in the literature on sparse 
estimation that are readily implemented in R. The considered estimators are: 

1. The Lasso estimator with regularization parameter obtained by ten- fold 
cross-validation; 
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2. The MC+ estimator of Zhang (2010) with regularization parameter obtained 
by ten-fold cross-validation; 

3. The SCAD estimator of Fan and Li (2001) with regularization parameter 
obtained by ten-fold cross-validation. 

The Lasso estimator is calculated using the glmnet package in R (Friedman, Hastie and Tibshirani, 
2010). The cross- validated MC+ and SCAD estimators are implemented in the 
ncvreg package in R (Breheny and Huang, 2011). 

The performance of each of the four estimators generically denoted by 6 is 
measured by its prediction error |X(^ — 0*)\2/n = \\^§ — fe*|P- Moreover, even 
though the estimation error |^ — 0*11 is not studied above, we also report its 
values for a better comparison with other simulation studies. 

7.1.1 Coordinatewise sparsity The vector 9* is given by 6j = ]I(j < S) for 
some fixed S so that M(6'*) = S. 

We considered the cases (n, M, S) G {(100, 200, 10), (200, 500, 20)}. The Metropo- 
lis approximation was computed with Tq = 3, 000, T = 7, 000, which should 
be in the asymptotic regime of the Markov chain since Figure 3 shows that, 
on a typical example, the right sparsity pattern is recovered after about 2,000 
iterations. 

Figure 2 displays comparative boxplots and Table 1 reports averages and stan- 
dard deviations over the 500 repetitions. In particular, it shows that ES outper- 
forms the Lasso estimator and has performance similar to MC+ and SCAD. 



(M,n, S) 


ES 


Lasso 


MC + 


SCAD 


(200, 100, 10) 


0.14 


0.82 


0.18 


0.17 




(0.11) 


(0.28) 


(0.17) 


(0.15) 


(500, 200, 20) 


0.29 


1.78 


0.31 


0.29 




(0.16) 


(0.43) 


(0.14) 


(0.12) 


{M,n,S) 


ES 


Lasso 


MC + 


SCAD 


(200, 100, 10) 


0.12 


0.50 


0.15 


0.14 




(0.08) 


(0.15) 


(0.10) 


(0.10) 


(500, 200, 20) 


0.25 


L02 


0.27 


0.26 




(0.11) 


(0.22) 


(0.11) 


(0.10) 



Table 1 

Means and standard deviations of performance measures over 500 realizations for the ES, 
Lasso, MC+ and SCAD estimators. Top: estimation performance \6 — ^*||. Bottom: Prediction 

performance: \X.{9 — 9*)\2/n. 

Figure 3 illustrates a typical behavior of the ES estimator for one particular 
realization of X and For better visibility, both displays represent only the 50 
first coordinates of 9^^, with Tq = 3,000, T = 7,000. The left-hand side display 
shows that the sparsity pattern is well recovered and the estimated values are close 
to one. The right-hand side display illustrates the evolution of the intermediate 
parameter 9p^ for t = 1, . . . , 5000. It is clear that the Markov chain that runs on 
the M-hypercube graph gets trapped in the vertex that corresponds to the sparsity 
pattern of 9* after only 2,000 iterations. As a result, while the ES estimator is 
not sparse itself, the MH approximation to the ES estimator may output a sparse 
solution. 

Fused sparsity. The vector 9* is chosen piecewise constant as follows. Fix 
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Figure 2. Boxplots of performance measure over 500 realizations for the ES, Lasso, MC+ and 
SCAD estimators. Top: estimation performance \9 — 9*\\. Bottom: Prediction performance: |X(0 — 
d*)\%/n. Left: {M,n,S) = (200,100,10). Right: {M,n,S) = (500,200,20). 



an integer S > 1 such that lOS < M and consider the blocks Ii, . . . ,Is defined 
by 

={10(i-l) + l,...,10j}, j = l,...,S. 

The vector 9* is defined to take value (—I)-' on I j,j = 1,. . . ,S and 1/2 else- 
where. We considered the cases {M,n,S) e {(200, 100, 10), (500, 200, 20)} that 
are illustrated in Figure 5. Note that in both cases, the vector 9* is not sparse. 

The fused versions of Lasso, MC+ and SCAD are not readily available in R and 
we implement them as follows. Recall that D is the M x M matrix defined in 
Subsection 5.2 by {D9)i = 9i and {D9)j = 9j-9j-i for j = 2, . . . , M. The inverse 
is the M X M lower triangular matrix with ones on the diagonal and in the 
lower triangle. To obtain the fused versions of Lasso, MC+ and SCAD, we simply 
run these algorithms on the design matrix XD^^ to obtain a solution 9. We then 
return the vector D~^9 as a solution to the fused problem. 

We report the boxplots of the two performance measures |X(^ — 9*)\2/n and 
1^ — 0*1 1 in Figure 4. It is clear that, in this example, Exponential Screening 
outperforms the three other estimators. Moreover, MC+ and SCAD perform par- 
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Index Iteration 



Figure 3. Typical realization for {M,n,S) = (500,200,20). Left: Value of the 6"^^, To = 
3,000, T = 7,000. Right: Value of 6p^ for t — 1,...,5000. Only the first 50 coordinates are 
shown for each vector. 



ticularly poorly in the case {M,n,S) = (500,200,20). Their output on a typical 
example is illustrated in Figure 5. We can see that they yield an estimator that 
takes only two values, thus missing most of the structure of the problem. It seems 
that this behavior can be explained by the fact that the estimators are trapped 
in a local minimum close to zero. 
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Figure 4. Boxplots of performance measure 
Lasso, Fused-MC+ and Fused-SCAD estimators. 
Prediction performance: |X(^ — 0*)\2/n. Left: 
(500,200,20). 




over 500 realizations for the Fused-ES, Fused- 
Top: estimation performance \6 — 9*\2- Bottom: 
{M,n,S) = (200,100,10). Right: {M,n,S) = 
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Fuacd-ES 




Fused- MC+ 



Fused- SCAD 



20O 300 



Figure 5. Typical realizations of the fused estimators in the cases {M,n,S) = (200,100,10) 
(top) and {M,n,S) = (500,200,20) (bottom). 
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8. APPENDIX 

The proof of (2.2) is standard, and similar results have been formulated in the 
literature for various other setups. We give it here for the sake of completeness. 
Prom the definition of the empirical risk minimizer J^rm^ have 

where /* is any minimizer of the true risk R{-) over Ti. Simple algebra yields 

Rirn < Rin + '^nr - r^,Y- rj), 

where for two functions f,g from A' to IR we define {f,g) = ^ X^iLi f{^i)9{^i) • 
Next, observe that 



nr - Y - r,) < lEmax(r - Y - r?) < 2a 

where we used the fact that ||/* — fj\\ < 2 and the inequality ]E[maxi<j<j\,/ < 
y/2 log M valid for M standard Gaussian random variables C,i- 

We now turn to the proof of (2.4). Consider the random matrix X of size nx M 
such that its elements Xjj-,z = j = 1,...,M are i.i.d. Rademacher 

random variables, i.e., random variables taking values 1 and —1 with probability 
1/2. Moreover, assume that 

for some positive constant Ci < 1/2. Note that (8.1) follows from (2.3) if Cq is 
chosen small enough. Theorem 5.2 in Baraniuk et al. (2008) (see also Subsection 
5.2.1 in Rigollet and Tsybakov, 2011) entails that if (8.1) holds for Ci small 
enough, then there exists a nonempty set A4 of matrices obtained as realizations 
of the matrix X that enjoy the following weak restricted isometry (wRi) property. 
For any X ^ Ai, there exists constants k > k > 0, such that for any A G IR^^ 
with at most 2 nonzero coordinates, 

I YX|2 

(8.2) ^^|A|i<^<^2|A|L 

n 

when (8.1) is satisfied. Here | • I2 denotes the £2 norm. For X £ Ai, let . . . , (pM 
be any functions on X satisfying 

(j)j{xi) = Xij , i = l,...,n, j = 1,...,M , 

where Xij are the entries of X. Note that \\(f)j\\ = 1 since Xij £ {—1, 1}- 
Fix T > to be chosen later and set 

fj=T{l + a)^j, j = l,...,M, 

where we set for brevity a = (c/S) y''^^^- Moreover, consider the functions 

Vj='ra4>j, j = l,...,M. 
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Using (2.3) we choose r small enough to ensure that \\rj\\ < 1 and ||/j|| < 1 for 
any j = 1,...,M. 

We write Rj{-) to denote the risk function i?(-) when rj = rjj in (1.1). It is easy 
to check that 

(8.3) unnR,{f) = R,{f,) = \\f,-r,jf. 

As it is customary in the proof of minimax lower bounds, we reduce our estimation 
problem to a testing problem as follows. Let ip G {l,...,M} be the random 
variable, or test, defined hy = j if and only if Sn = fj. Then, ip ^ j implies 
that there exists k ^ j such that Sn = fk, so that 

ll^n - Vjf - Wfj - VjW = Wfk - fjf + 2{fk - fjjj - Vj) 

= T\l + afUj-<Pkf + 2T\l + a)i{(l)j,(l)k)-l) 

Prom (8.2), we find that — (j)k\\'^ > 2k^ so that 



HQ l|2 llf ||2 ^ ^T^kV / logM ^ 

\\Sn-r]j\\ -\\fj-r]j\\ > ^_ ^——=Un,M- 
Therefore, we conclude that ip ^ j implies that 

Rj{Sn) - mm Rj{f) > Vn,M- 

Hence, 

(8.4) max Pj \ Rj (Sn) - mm Rj (/) > Un,M f > inf ^ max Pj (^ / j) , 

where the infimum is taken over all tests taking values in {1, ... , M} and Pj de- 
notes the joint distribution of Yi, . . . , 1^ that are independent Gaussian random 
variables with mean rij{xi) respectively. It follows from Tsybakov (2009, Propo- 
sition 2.3 and Theorem 2.5) that if for any 1 < j,k < M, the Kullback-Leibler 
divergence between Pj and Pk satisfies 

logM 

(8.5) /C(P„P,)<^, 
then there exists a constant C > such that 

(8.6) inf max P,(V^/i)>C7. 

t/) 1<]<A1 

To check (8.5), observe that, choosing r < 1 and applying (8.2), we get 
r^P P\ ^11 i|2 I^]osi^iu ^ ii2 ^ iQg^ 

Therefore, in view of (8.4) and (8.6), we find using the Markov inequality that 
for any selector Sn, 



max Ej 



Rj{Sn) - mm Rj{f) 



n 

where Ej denotes the expectation with respect to Pj. | 
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